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Abstract: We characterize and study variable importance (VIMP) and 
pairwise variable associations in binary regression trees. A key component 
involves the node mean squared error for a quantity we refer to as a max- 
imal subtree. The theory naturally extends from single trees to ensembles 
of trees and applies to methods like random forests. This is useful because 
while importance values from random forests are used to screen variables, 
for example they are used to filter high throughput genomic data in Bioin- 
formatics, very little theory exists about their properties. 
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1. Introduction 

A binary recursively grown tree, T, can be thought of as a map that uniquely 
identifies a multivariable covariate x£ f with tree node membership. Here x is 
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a d-dimcnsional vector x = (xi, . . . , x v ) £ R d , where x v are real valued variables. 
If T has M terminal nodes, then T is the function T : 2£ — >■ {1, . . . , M} defined 
via 

M 

T(x) = mB m (x). (1.1) 

m— 1 

Here B m (x) are 0-1 basis functions that, because of the recursive nature of T, 
partition S£ . 

An important class of binary trees are those associated with Classification 
and Regression Trees (CART), a widely used method in regression and multi- 
class data settings In CART, a binary tree is grown using recursive splitting 
based on node impurity. The best split for a node is found by searching over 
all possible binary splits for each variable, and identifying the split for a vari- 
able leading to the greatest reduction in tree node impurity. Binary trees are 
also implicitly used in Random Forests Q, a popular ensemble technique used 
for prediction. Rather than using a single tree, random forests constructs an 
ensemble predictor by averaging over a collection of binary trees. Each tree is 
grown from an independent bootstrap sample of the data, where at each node 
of the tree a randomly selected subset of variables of size mtry are chosen as 
candidate variables to split on. Averaging over trees, in combination with the 
randomization used in growing a tree, enables random forests to approximate a 
rich class of functions while maintaining low generalization error. This enables 
random forests to adapt to the data, automatically fitting higher order interac- 
tions and non- linear effects, while at the same time keeping overfitting in check. 
This has led to a great interest in the method and applications in many fields. 

While CART and random forests are often used for exploratory data analy- 
sis, they can also be used to select variables and reduce dimensionality. This is 
done by ranking variables by some measure of importance and removing those 
variables with low rank. Variable importance (VIMP) was originally defined in 
CART using a measure involving surrogate variables (see Chapter 5 of [H). Def- 
initions in terms of mean overall improvement in node impurity for a tree have 
also been proposed. In regression trees, node impurity is measured by mean 
squared error, whereas in classification problems, the Gini index is used [3|. The 
most popular VIMP method to date, however, adopts a prediction error ap- 
proach involving "noising-up" a variable. In random forests, for example, VIMP 
for a variable x v is the difference between prediction error when x v is noised 
up by permuting its value randomly, compared to prediction error under the 
original predictor 0, 0, [H, 0] • This type of definition has become quite popu- 
lar. For example, there is a growing trend in Bioinformatics where importance 
values from forests arc used as a filtering method to reduce dimensionality in 
high-throughput genomic data 

0, IS IS El- 

Although VIMP has been studied in general settings [HI], there is still a 
great need for a systematic theoretical study focusing on trees. In this paper we 
provide such a treatment for the class of binary regression trees and their forests. 
This applies to data settings where the outcome Y is continuous. We define 
rigorously VIMP as well as a measure of association between pairs of variables, 
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and derive theoretical properties for such values. We find that something we call 
a maximal subtree and its node mean squared error play a fundamental role in 
our results. 

It should be noted carefully that the definition for VIMP adopted here differs 
from that used in random forests. The permutation method used by random 
forests is complex and proved too difficult to analyze theoretically in any detail. 
This motivated us to look for an approach simpler to study, but with similar key 
features. Hence, we have adopted a simpler definition whereby VIMP is defined 
as the difference between prediction error under noising and without noising, 
where noising up a variable corresponds to what can be thought of as a random 
left-right daughter node assignment. Section [3] provides a rigorous definition of 
this process and discusses in detail how this method compares with random 
forests. 

2. Binary regression trees 

We begin by first defining a binary regression tree. 

Definition 1. Let «5f = {(Y^x,) : i = 1, . . . , n} denote the learning data. A 
binary tree T grown from Jzf is a tree grown using successive recursive binary 
splits on variables of the form x v < c and x v > c where split values c are chosen 
from the observed values of Xj . The terminal values a m for a terminal node of T 
is the average of those Yi with Xj having node membership m. In other words, 
a m = Er=i I{T(^) G m}Y/j:ti I{T(*i) G "»}■ 

A binary regression tree (hereafter simply refered to as a binary tree) must 
be of the form (jl.ip . Moreover, because of the nature of recursive partitioning, 
the basis functions B m (x) in T are product splines of the form: 

1=1 

Here L m are the number of splits used to define £? m (x). Each split I involves a 
variable, denoted by x;( m ) for the /(m)-th coordinate of x, and a splitting value 
for this variable denoted by Q )7n . The 

s i,m arc ±1 binary values, and for any 
given x, [x]+i = l{x > 0} and [x]_i = l{x < 0}. One can think of B m (x) as a 
0-1 function that is 1 only if x sits inside the hyper-rectangular region defined 
by xu m ) an d the splits ci iTn . 

As mentioned in the Introduction, recursive binary partitioning ensures that 
the resulting decomposition of 9C induced by the basis functions is a non- 
overlapping partition. Hence, an important property possessed by 2? m (x) is 
orthogonality. Thus, 

B m (x)B m , (x) = if m ^ to'. (2.1) 
This property will play a key role in our theoretical development. 
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3. A surrogate VIMP 

We define the VIMP for a variable x v as the difference between prediction error 
when x v is "noised up" versus the prediction error otherwise. To noise up x v 
we adopt the following convention. To assign a terminal value to a case x, drop 
x down T and follow its path until either a terminal node is reached or a node 
with a split depending upon x v is reached. In the latter case choose the right or 
left daughter of the node with equal probability. Now continue down the tree, 
randomly choosing right and left daughter nodes whenever a split is encountered 
(whether the split depends upon x v or not) until reaching a terminal node. 
Assign x the node membership of this terminal node. 

The left-right random daughter assignment is designed to induce a random 
tree that promotes poor terminal values for cases that pass through nodes that 
split on x v . In fact, one can simply think of the noising up process equivalently 
in terms of this random tree, which we shall denote by T v . The prediction 
performance of T v is intimately tied to the VIMP of x v , which is directly related 
to the location of x v splits in T. The more informative x v is, the higher up x v 
splits will appear in T, and consequently the more perturbed T v becomes relative 
to T. This results in a loss of prediction accuracy for T v and a high VIMP for 

Xy • 



3. 1 . Comparison to random forests 

While our noising up procedure is different than the current method employed 
in random forests, our procedure is designed to approximately mimic what is 
seen in practice. In random forests, VIMP is calculated by noising up a variable 
by randomly permuting it. A given x v is randomly permuted in the out-of-bag 
(OOB) data (the data not selected by bootstrapping) and the noised up OOB 
data is dropped down the tree grown from the in-bag data (bootstrap data). 
This is done for each tree in the forest and an out-of-bag estimate of prediction 
error is computed from the resulting predictor. The difference between this 
value and the out-of-bag error without random permutation is the VIMP of 
x v . Large positive values indicate x v is predictive (since noising up x v increases 
prediction error), whereas zero or negative importance values identify variables 
not predictive [2j. 

Often one finds in random forests that variables that tend to split close to 
the root have a strong effect on prediction accuracy [Hj]. When such variables 
are randomly permuted, each tree is perturbed substantially, so when out-of- 
bag data is dropped down the tree, the resulting predictor has poor prediction. 
Averaging over such predictors results in an ensemble with high error rate and a 
large VIMP for the variable. Variables tending to split lower down in a tree, on 
the other hand, have much less impact, because as one travels through a binary 
tree the amount of data drops off exponentially fast. Consequently there are 
much fewer observations affected and the noised up predictor does not suffer so 
much. 
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Our left-right noising up process shares with random forests the key feature 
that VIMP is directly impacted by location of the primary split relative to the 
root node. For this reason we believe our approach can provide insight into 
VIMP for random forests. 

However, the analogy is by no means perfect. A nice example of this, sug- 
gested by one of the reviewers of the paper, is as follows. Recall that at each 
node, while growing a tree, random forests randomly selects a subset of mtry 
variables to split on. In a problem with a large number of non-informative vari- 
ables, it is possible that early on in the tree growing process the mtry variables 
for a node are all non-informative. Hence, the eventual split for the node is on 
a non-informative variable, high up in the tree. Trees are, however, highly data 
adaptive and can recover from scenarios like this. Splits further down the tree 
can be on informative variables and the tree can still be predictive. Now consider 
what happens when we drop a case down the tree using the left-right noising up 
process. If the case passes through the node split on the non-informative vari- 
able x v , then it follows a random left-right path through potentially predictive 
splits. Since the node is high in the tree, the resulting random tree T v suffers in 
terms of prediction and x v can appear informative. 

This type of scenario shows that a non-informative variable can appear in- 
formative over a single tree under our noising up process. However, over a forest 
such an effect will most likely be washed out. With many non-informative vari- 
ables, the probability that x v splits high in a tree is small. Since the event has 
low probability, the forest will contain few trees with high x v splits. Averaging 
over trees pushes the VIMP of x v to zero. Moreover, for a single tree, this kind 
of problem can be resolved by slightly modifying the noising up process. Rather 
than using random left-right assignments on all nodes beneath random 
assignments for only those nodes that split on x v . This will impact prediction 
only when x v is informative and not affect prediction for non-informative vari- 
ables. This modified process, in fact, was used in early attempts to study this 
problem. Unfortunately, though, while it was our feeling this type of VIMP 
could compete against the one used by random forests, it was still too complex 
for the kind of detailed theoretical development we were after. 

It is important to emphasize that while deficiencies like those mentioned 
suggest our noising up process may not be practicable in all data examples, this 
does not detract from theory developed under this mechanism. As we define 
shortly, a key theoretical concept in our framework is the notion of a maximal 
subtree. We believe this is key to understanding VIMP, even complicated values 
like those used by random forests. 

3.2. Subtrees 

To discuss maximal subtrees, first note the predictor associated with the original 
tree T can easily be written in closed form. Denote the predictor by /t(x). It 
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follows that 

M 

p,(x.) = ^2 a m B m (x). 

m— 1 

Surprisingly, it is also possible to give an explicit representation for the predictor 
of T v . To do so, we introduce the notion of a subtree and a maximal subtree. 

Definition 2. Call f v a v-subtree of T if the root node of f v has daughters 
that depend upon an x v split. Call T v a maximal w-subtree if T v is not a subtree 
of a larger ^-subtree (see Figure [J). For example, the root node tree T is always 
a maximal w-subtree for some v. 

Suppose that T contains K v distinct maximal w-subtrees, T\ jV , . . . , Tk v , v - 
Let Mk, v be the set of terminal nodes for the subtree Tk,v and define M v = 
Ufe^i Mk, v - Observe that Mk,v are disjoint sets and that M v is the set of terminal 
nodes of T that can only be reached by going through at least one node involving 
a split on x v . The following key Lemma shows that the predictor for T v can be 
written as a deterministic component (conditioned on T) depending on terminal 
nodes not in M v and a random component involving terminal nodes in M v . 




Fig 1. Large box contains a maximal v-subtree, for v = 3. Smaller box containing tree with 
nodes 5, 6 and 7 is also a v-subtree, but it is not maximal. 
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Lemma 1. Let p, v (x) denote the predictor for T v . Then, 

ft v {x) = ^ Qm-Bm(x) + ^ a fc|t ,J{r(x) g M k , v }, (3.1) 

m^M„ k=l 

where a>k,v is the random terminal value assigned by T^_ v under a random left- 
right path through T^ v . Write Pj s:v for the distribution ofa^ v . Note that P^ v 
is conditional on the original tree T . 

Proof. If T(x) ^ M v , then x is assigned the same terminal node as in 
T(x). This explains the first sum in the expression (|3.ip . On the other hand 
if T(x) S M v , then x passes through a root node of a maximal subtree, say 
Tk.y Upon entering this node, x follows a random path through Tk, v until even- 
tually terminating at a node m € Mk tV - Thus, the terminal value for x is the 
random value dk,v assigned by a random left-right path through Tk tV . □ 

3.3. Prediction error 

To formally discuss VIMP it is necessary to first define what we mean by pre- 
diction error (PE). Let g be some loss function. The PE for ft is defined as 

5»(A) = E( 5 (y,A(x))). 

The expectation on the right-hand side is joint over the original learning data 
as well as the independent test data (Y, x). It is assumed that Y can be 
written as a regression model 

y = Mx) + e, (3.2) 

where e has zero mean and variance a 2 > and e is assumed independent of 
/z(x). In a similar way one can define PE for ft v : 

^(A„) = E( 5 (y,A„(x))). 

Note the expection on the right-hand side involves an extra level of integration 
over dk t v via its distribution Pfe jt , . 

We now define the VIMP of x v . In this paper we confine attention to the 
case when g(Y, ft) = (Y — ft) 2 , corresponding to L 2 -loss. The VIMP value for x v 
under L2-I0SS is defined as 

A v = &{ft v ) -&(fi). 
Using standard arguments it is easy to show that under (|3 . 2[) . 



&(fi) = a 2 +E(n(x) -A(x)) 2 . 



(3.3) 
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Now observe by Lemma Q] that (l v (x) can be rewritten as 

A«(x) = A( x ) + X! X! (°' k < v ~ a m )B m (yL). 

k=l m£M k ,v 

Using a similar expansion for ^(fl v ) as (|3.3j) . deduce that the VIMP for i„ is 
A v = E (i?„(x) 2 ) - 2E{i?„(x) [ M (x) - A(x)] } 

where 

K v 

R v (x) = ^ X! ( 5 M ~ a m)-B TO (x). 

k=l m£M ki v 

Clearly A.„ depends heavily upon the relationship between a m and a,k t v 

For 

example, A„ = if a^ v = a m for each m £ Mk, v and each k. In this case, all 
terminal values within a maximal subtree arc equal and randomly assigning a 
terminal value cannot perturb matters. Hence, x v is non-informative as further 
clustering of the data is ineffective. In general, however, noising up x v will 
perturb 

&m,v relative to ct m and result in a positive or negative value for A^. 
Trying to dctangle the resulting properties of A„ in such settings requires a 
theoretical framework which we discuss next. 

4. A theoretical framework for VIMP 

In order to study A^ systematically, it will be necessary to make a simplifying 
assumption about the true signal /i(x). In particular, it will be helpful if we 
assume that /t(x) is a good approximation to //(x). We shall assume the true 
signal has a similar topology to T: 

M 

/u(x) = ^ a m ,oB m (x), (4.1) 

where a mj o are the true, but unknown, terminal values. Observe that /i(x) is 
random because i? m (x) are basis functions determined from the learning data. 
Nevertheless, we shall still think of /i(x) as our "population parameter" because 
the terminal values for /x(x) are fixed and do not depend upon the data. Note 
importantly that (|3.3p still holds even though /i(x) depends upon the data. 

Remark 1. Assumption (|4. 1[) might seem simplistic at first glance, but we'll 
show later in Scction[5]that it is quite realistic when put in the context of forests. 

By exploiting the orthogonality property (|2.1|) of the tree basis functions, we 
can greatly simplify the expression for A„ under assumption (|4.1|) . Consider the 
following theorem. 
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Theorem 1. Assume the true model is 113. 2\) and that holds (for any given 

££). Then, 

a »=4e e *•< 

lfe=l m.EM kiV 

where ir m = P{B m (x) = } and s\ and a^.v denote the variance and mean 
for d k , v under P k , v . 



s l,v + { a k,v - Ira) (o>k,v + ' 



2a,, 



(4.2) 



Proof. Using orthogonality (|2.1[) and assumption l|4.1[) . collect together com- 
mon expressions to deduce that 

A -= E {E E H, m -2r k , m r m ]B m {x)\, 

Lfc=lmSM fc ,„ J 

where rfc jjn = — a m and r m = a TOl o — a m- The expectation sitting outside 
the sum can be partially simplified by integrating over Pk, v while conditioning 
on Jz? and the test data x. A little bit of re-arrangement yields 

A : \ ^2 E [ s m + fe k < v ~ flm ) fe k < v + am ~ 2a ™.°) ^™( x ) r • 

Lfc=l m6M t .„ J 

Continuing to keep Jz? fixed, bring the expectation over x inside the sum. Only 
the term B m (x) depends upon x, from which (|4. 2|) follows. □ 

Theorem [T] has an interesting implication. Let Pk, v ,o be the distribution for 
a,k,v,o- the random terminal node assigned from a random path through the 
maximal subtree Tk, v with terminal nodes replaced by {a mj o : m £ Mk, v }- Call 
the quantity 

6 (k,v) = ^ nmPk,vfl(®k,v,o ~ a m , Q ) 2 , for k = l,...,K v , (4.3) 

raEM ti , 

the node mean squared error for the subtree X^. An immediate corollary to 
Theorem [1] is that the closer /i(x) is to the true signal, the greater the effect 
node mean squared error has on VIMP. 

Corollary 1. If a m — > a m ,o, as one might hope for if sample sizes converge to 
oo, 

A - ->e|£>(M)|. 

Note because 9o(k,v) > 0, all VIMP values are non-negative in the limiting case. 
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Corollary[l]shows in the limit that each maximal u-subtree contributes equally 
to A„ through its node mean squared error. Furthermore, the node mean squared 
error for a subtree T]~ tV is a weighted average with some nodes contributing sig- 
nificantly more to overall mean squared error than others. From (|4.3p . we know 
that 9o(k,v) is a weighted average involving weights 7r m . The value n m is the 
probability that a fresh test x has node membership m under T. If many splits 
are needed to reach m, or equivalcntly if £? m (x) involves a large number of basis 
functions, then 7r m is likely to be small. Hence, nodes far down Tf. jV contribute 
less to 0o(k, v) then nodes closer to its root node. Moreover, this effect is further 
amplified due to Pk,v Under random splitting, if L m splits are needed to reach 
a node m from the root of Tk >v , then P^.v assigns mass 2~ Lm to a m . So again, 
the closer a node is to the root of the subtree, the bigger its impact on node 
mean squared error and A„. This is consistent with what was said earlier in 
Section [3] regarding impact node position has on VIMP. 

5. Forests 

In this section we extend our theory to forests. We begin with a formal definition. 

Definition 3. Let c € = {T(x; b) : b = 1, . . . , B} be a collection of 1 < B < oo 
binary trees grown from The forest fiF, for the collection of trees ^, is the 
ensemble predictor defined by 



where /t(x; 6) denotes the predictor for T(x; b). 

Note unlike trees, forests do not assign node membership values. Forests 
are simply predictors, and they are much better at predicting than individual 
trees. In fact, under mild conditions, one can show that the closure of linear 
combination of binary tree predictors form a complete space. This leads to the 
following theorem (c.f. Proposition 2 of (l3|V 

Theorem 2. Assume x has distribution Vq that is absolutely continuous with 
respect to Lebesgue measure with support Sd on a finite closed rectangle in M. d . 
Then for each real valued function fi 6 L2(Po), and each 5 > 0, there exists a 
forest fiF constructed from binary trees with M > d terminal nodes such that 




Proof. The L 2 (Po)-closure of finite linear combinations of indicator functions 
of rectangles over Sd equals Z^O^o)- Therefore, if we can show that any indica- 
tor function for a rectangle on Sd can be described by a predictor /t(x) for a 
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binary tree T with M > d nodes, the result follows because the closure of finite 
linear combinations of such predictors is L 2 (¥q). It suffices to consider indicator 
functions of the form 

J(x) = I{xi <C 1 ,X 2 <C 2 ,...,Xd< Cd}- 

To construct T, first split x\ at c\. This yields a left daughter node x\ < cy. 
Now split the left daughter at x 2 < c 2 and it's left daughter at X3 < C3 and 
so forth up to Xd < Cd- Label the resulting terminal node m — 1 and denote 
its basis function by -Bi(x). Observe that d splits are needed to create -Bi(x) 
and therefore T has M = d + 1 terminal nodes. Let ai = 1 and set a m = for 
m > 1. Then /i(x) = /(x). □ 



5. J. VIMP for forests 

Suppose that T(x;b), & = 1,...,-B, are distinct binary trees grown from the 
learning data _S? . For example, each tree could be grown as in random forests 
but without bootstrapping the data. Extending our previous notation in an 
obvious way, the forest predictor for the collection of trees is defined to be 

^ B M (i,) 
6=1 m=l 

In light of Theorem [2l it seems quite reasonable to assume that /*f(x) is a 
good approximate to the true signal so long as > d. Thus if each tree in 

the forest is deep enough that they have at least d + 1 terminal nodes, we can 
assume, analogous to (|4.1j) . that 

B M {b) 

^(x) = -^^ a > m (x;fc). (5.1) 

6=1 m=l 

Noising up a variable x v in a forest is straightforward. One simply noises 
up x v for each tree as before. Then one computes the forest predictor, p,p v (jx), 
by averaging over the corresponding perturbed predictors £l v (x; b) . Extending 
notation in an obvious manner, by Lemma [T] we have 

1 3 { K ^ 1 

A F „(x) = -]T ]T a^B m {^b) + Y.^l I {n^b)&M^} . 

Now using a similar argument as in Section [3.31 the VIMP for x v is: 
A F „ = E (R Fv (x) 2 ) - 2E{R Fv (x) [/i(x) - Mx)] }, 

where 

6=1 k=i m eMl b i 
From this we arrive at the following result. 
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Theorem 3. Let RF vy o( x ) be the function i?^(x) in which a m is replaced with 
the value a^ . Assume \3. || ) and 15.1)) holds. If o}$ — > a^ for each m and b, 
then 

{ 1 3 K ^ 1 

A Fv E (^,.o(x) 2 ) <E<^ - ^^9 (M,6) , (5.2) 

w/iere 

<?o(fc,t;,&)= ^ ^ l6 PS,o(41o-^o) 2 
meAI kl 

is the node mean squared error for the kth maximal v-subtree of T(x; b) and 
n m ,b = P{5 m (x; 6) =1|J2?}. 

Proof. The limit follows automatically. The bound on the right-hand side 
of (|5.2p is due to Jensen's inequality and orthogonality (|2.ip . □ 

5.2. Implications for random forests 

The bound in (15. 2p becomes tighter as the trees in the forest become orthogonal 
to one another. Moreover, a key assumption in Theorem ([3]) is that (|5.1|) holds, 
which by Theorem [2] is reasonable so as long as the individual trees in the forest 
are rich enough. Thus, a prescription for VIMP to be positive and fully charac- 
terized by subtree node mean squared error is that the forest should be derived 
from trees rich enough to approximate /x(x) while simultaneously being nearly 
orthogonal to one another. This is quite interesting, because this is the same 
prescription needed to achieve low generalization errors in random forests (see 
Theorem 11.2 of Q). So the right-hand side of (|5.2[) is a very reasonable starting 
point for understanding VIMP in random forests. In fact, in Scction f6.il we will 
show that node mean squared error terms like those appearing in (|5.2| play a 
crucial role in understanding paired associations between variables in forests. 

6. Paired importance values and associations 

For the moment let us return to a study of single trees. We consider the joint 
VIMP of a pair of variables (x v ,x w ) in a binary tree T. Let t = (v,w) and 
write Xt to indicate (x v ,x w ). The paired VIMP for x t is the difference between 
prediction error when xt is noised up versus the prediction error otherwise. 

Noising up a pair of variables is defined as follows. For a case x, drop x down 
T, and if x enters the root node of a maximal v-subtree T^.u, or a maximal 
u;-subtree T/. iW , then initiate a random left-right path until a terminal node is 
reached. This will assign x a random terminal value dk,t with distribution Pk,t- 

Similar to our perturbation for a single variable, one can think of the above 
randomization process as defining a random tree. We refer to this tree as T t . 
Let 

Ti,t, ■ ■ ■ , Tx t ,t 
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be the set of maximal v and w-subtrecs of T . Observe that in some cases a 
maximal w-subtree can be a u>-subtree and a maximal w-subtree can be a v- 
subtree. Thus, {Tk,t} are a collection of not necessarily disjoint trees. To avoid 
overlap, we let 

{f iktV , k = 1, . . . , K*} and {f iuw , 1 = 1,..., ICJ (6.1) 

denote the set of distinct maximal v and iu-subtrees of T. The list (|6.ip is 
constructed such that in the case where a maximal ^-subtree and w-subtree 
intersect, the larger of the two subtrees is used. In particular, note that K* < K v 
and A'*, < K w where K v and K w are the number of distinct maximal t;-subtrees 
and ui-subtrees, respectively. When K* = K v and if* = K w , then the set of 
v and u;-maximal subtrees are disjoint and v and w arc orthogonal variables. 
Otherwise, v and w are associated. Wc will say more about this shortly. 

Similar to Lemma [T] one can show that jit, the predictor for T t , must be of 
the form: 

/t t (x) = Y a m B m (yi) + Y a kiV I{T(x.) 6 M kiV } 

+ J2 ^ J {T{^) e M Ww }. 

Here J(f{v) and J(f(w) are the set of indices {ik ■ k = 1, . . . , K*} and {ii : I = 
1, . . . , K^}, and M t is the set of all terminal nodes for (|6.ip . Note importantly 
that a,k, v and ai. v have distribution Pk iV and Pi. w . 
Dchne the paired VIMP for xt as 

We compare A< to a composite additive effect of the individual variables v and 
w. This will allow us to identify pairs of variables whose "joint effect" differs 
from the sum of their individual components. Thus allowing identification of in- 
teresting paired assocations (for another interesting way to identify interactions 
see 

The marginal importance effect for v and w under an additive effect paradigm 
is defined to be 

^v-\-w — A^ -\- A w . 

The difference between the two sets of importance values is what we call A t , 
the measure of association between 

A t = A t - A v+W = A t - (A v + A w ). (6.2) 

Using similar arguments as in Theorem [TJ we obtain the following characteriza- 
tion of A t . 

Theorem 4. If a m — > a m fi, then under the conditions of Theorem^ 




9 (k,v)+ ]T 9 Q (l,w) 
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Theorem |4] shows in the limit that A t can never be positive. The reason for 
this is that A„ and overcount maximal subtrees since a maximal u-subtrce 
can be a subset of a maximal w-subtree and vice- versa. Consequently, A t , + A w 
can never be smaller than A t . If v and w are orthogonal, then there is no 
overlap between maximal v and w-subtrees and no overcounting occurs. Hence, 
the above sums are null and A t = in the limit. On the other hand, if v and w 
are not orthogonal, then A t has a negative limit. The more negative, the more 
overlap there is between v and w-subtrees, and the stronger the indication of 
an association between the two variables. 

6. 1 . Paired association values for forests 

Association between variables is a little more delicate when it comes to forests. 
We show that association values can not only be negative, but they can also be 
positive in forests. 

Extending notation in an obvious manner, the forest predictor under a noised 
up xt is defined as 

^( x ) = ^e{ E a$B m (x;b)+ «K/{ T M) eM 2} 

6=1 ^m^A/ t (6) keJ*r(v,b) 

+ £ ^{t(x; & ) G <>}1. 

Z6JT(to,&) J 

Moreover, it follows straightforwardly from our previous arguments that the 
VIMP for x t is 

A Ft = E (R Ft (x) 2 ) - 2E{# Ft (x)[/i(x) - Af(x)] }, 

where 

= iE{ E E (41-^)^^) 

+ E E (a&-aS?)iWx;6)}. 
From this we obtain the following characterization. 

Theorem 5. Let i?p t .o(x) be the function Rp t (x.) in which a$ is replaced with 
the value a^ . Likewise, let Rp vt o(pz) and Rp w fi(pL) be defined as in Theorem^ 
Assume i3. 2\) and Ii5.1\) holds. If Om — * &m0 f or sach m and b, then 
A = A Ft - (A Fv + A F J 

-► E (i?F t ,o(x) 2 ) - (E (i?F„.o(x) 2 ) + E (i? Fro . (x) 2 )) . (6.3) 
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Theorem [5] implies a delicate relationship for A t . This involves the way v and 
w interact within a given tree as well as how the variables interact across trees 
in the forest. Consider, first, the extreme case when all basis functions in Rp t 
are mutually orthogonal. This implies a forest where v and w-maximal subtrees 
are disjoint across trees. Then, 

A t ^-Ej-^W ]T 9 (k,v,b)+ ^> w > fo ) 

where 6a(k,v,b) and 9o(l,w,b) are node mean squared errors as defined in Theo- 
rcm[3J Note that the limit on the right-hand side is similar to that in ThcorcmHJ 
The only difference here is that we are averaging node mean squared error terms 
over trees. Just as in Theorem2]the limit of A t cannot be positive. Also, the size 
of At depends upon the behavior of v and w within a tree. For example, if v and 
w are orthogonal, such that for any given tree all maximal v and w-subtrees are 
disjoint, then At = in the limit. In contrast, the more overlap between v and 
ii'-subtrces within a given tree, the more maximal subtrees are overcounted, and 
the more negative At becomes. So just like single trees, large negative values 
indicate a strong association. 

On the other hand, if trees are correlated, then f|6. 3|) can be positive. One 
example is as follows. Suppose that v and w are highly correlated and both 
are influential. Then for any given tree both variables are strong competitors to 
grow the tree. If one variable is selected early on in the tree growing process, 
this might preclude the other variable from being selected. Things might be so 
bad that we could have 50% of the trees grown using v and 50% grown using w. 
So noising up v or w individually will only affect 50% of the trees. However, a 
joint noising up of variables perturbs all trees. Consequently, prediction will be 
substantially worse and we would expect A t to be positive. The next example 
is a good illustration of this. 




7. Air pollution 



For our first illustration we use the well known "air pollution" data set 15j . The 
variables are daily readings of various air quality values measured from May 1, 
1973 to September 30, 1973 in the New York metropolitan area. Solar radiation 
in Langleys (Solar), average wind speed in miles per hour (Wind), maximum 
daily temperature in Fahrenheit (Temp), month of the year (Month) and day of 
month (Day) were all recorded. The outcome is the mean ozone value (Ozone), 
which was transformed by taking its cube-root. 

We analyzed the data using random forests. Computations were carried out 
using the "randomForest" R-package [T3|. In regression settings, randomForest 
computes VIMP by subtracting an out-of-bag mean squared error rate under 
a random permutation of a variable x v from an out-of-bag mean squared error 
rate under the original x v . As we have discussed already, these values resemble 
A„, but are not exactly the same. 
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Since randomForest does not compute paired importance values we adopted 
the following strategy for (|6.2p . To compute an analog of A t , we randomly 
selected 63% of the data to act as training data and set the remaining 37% 
aside as test data (these values represent the average in-bag and out-of-bag 
sample size). A forest with 1000 trees was grown over the training data using an 
mtry value of 3 (see [3| for details about options for randomForest). Then over 
the test data we randomly permuted x v and x w and created a noised up test 
data containing the permuted x v and x w variables. Using the forest predictor 
from the training data we computed the mean squared error over the modified 
test data and then we computed the mean squared error over the unmodified 
(original) test data. We then took the difference between these two values. This 
was repeated 1000 times independently. The average over the 1000 replicates is 
our proxy for A t . Finally, in order to be consistent, we computed proxies for 
A„ and A^, in exactly the same fashion. That is, we created a test data with a 
permuted x v and compared its test mean squared error to the test mean squared 
error for the original test data. The same was done for x w . 

Table [1] presents our results. All results are averaged over replicates. The first 
column, "Paired" , records the paired VIMP for each possible pairwise compari- 
son; the second column lists the additive effects for a given pair of variables (the 
sum of their individual importance values) ; the third column lists the associa- 
tion value (the difference between the paired VIMP and the additive effects); 
and the last column is the association value divided by the out-of-bag mean 
squared error for the training data multiplied by 100. This is a standardized 
value for association expressed in percentages. 

Table 1 

Paired association values using random forests for air pollution data. Values reported are 
averaged over 1000 independent replicates with each forest based on 1000 trees. 





Paired 


Additive 


Association 


Assoc/MSE 


Tcmp:Wind 


0.706 


0.600 


0.106 


11.351 


Temp: Solar 


0.590 


0.529 


0.061 


6.532 


Temp:Month 


0.464 


0.455 


0.008 


0.904 


Temp:Day 


0.465 


0.462 


0.003 


0.298 


Wind:Solar 


0.232 


0.214 


0.017 


2.465 


Wind:Month 


0.142 


0.141 


0.001 


0.137 


Wind:Day 


0.150 


0.148 


0.002 


0.328 


SolanMonth 


0.074 


0.073 


0.000 


0.041 


Solar: Day 


0.084 


0.082 


0.002 


0.486 


Month:Day 


0.006 


0.006 


0.000 


0.046 



Table[T]shows that the largest association value is between the variables Temp 
and Wind. Also interesting, is the second largest association value occurring for 
Solar and Temp. Solar radiation and temperature are known to be related, 
and the fairly large association value in Table [1] confirms this relationship. The 
next largest association value is between Wind and Solar, but this value is 
substantially smaller (as can be seen by the standardized value in the 4th column 
of the table). After this, association values drop off dramatically. 

These results are based on random forests, whereas Theorem [5] applies to the 
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Fig 2. Coplot of ozone values against solar radiation given wind and temperature. 

VIMP under a special type of noising up process for forests. Thus, it is inter- 
esting to see how well the theory extrapolates. To consider this, we generated a 
conditional plot (coplot) of ozone values against solar radiation, conditioning on 
wind and temperature. See Figure [2] As seen, ozone increase with solar radia- 
tion, and the slope of the curves increase as temperature increases for any fixed 
range of wind speeds. This clearly demonstrates an association between solar 
radiation and temperature. Also, looking at ozone and radiation with respect 
to wind, the slope increases as wind increases at high temperatures, but this 
pattern does not apply at lower temperature values. This is clear evidence of a 
Temp:Wind association. These results are consistent with Table [T] 

8. Simulation 

A more direct confirmation of the theory can be illustrated using simulated 
data. For this example, data was simulated from the model 

Y — 30 sin(7rxia;2 ) + 2Q(x 3 - 0.5) 2 + 2O2124 + 5x 5 + e, 

where e was distributed independently from a standard normal distribution. All 
variables x v were drawn from a uniform distribution on [0,1]. This included a 
6th variable, xq, representing noise, which was also part of the x-design matrix. 
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A sample size of n = 100 was drawn. The data was then analyzed using the 
randomForest package exactly as outlined in the previous section. All settings 
were kept the same. The simulation was repeated 100 times independently. The 
averaged values over the 100 simulations are reported in Table [2] Variables 1-6 
are coded as a-f respectively in Table [2] 

Table 2 

Paired association values using random forests for simulated data. Analysis carried out in 





the 


same manner as TableU\ 






Paired 


Additive 


Association 


Assoc/MSE 


a:b 


185.216 


192.870 


-7.654 


-4.231 


ax 


132.347 


132.426 


-0.079 


-0.044 


aid 


140.472 


141.906 


-1.434 


-0.849 


a:c 


133.518 


133.530 


-0.011 


-0.003 


a:f 


131.552 


131.639 


-0.088 


-0.055 


b:c 


61.692 


61.736 


-0.044 


-0.033 


b:d 


71.350 


71.200 


0.150 


0.110 


b:e 


62.822 


62.857 


-0.035 


-0.023 


b:f 


60.990 


60.965 


0.025 


0.021 


c:d 


10.888 


10.870 


0.018 


0.014 


c:e 


2.500 


2.474 


0.026 


0.026 


c:f 


0.599 


0.589 


0.009 


0.007 


d:e 


11.948 


11.934 


0.015 


0.019 


d:f 


10.022 


10.040 


-0.019 


-0.022 


e:f 


1.713 


1.715 


-0.002 


-0.011 



By far, the two largest standarized association values are a:b and a:d, rep- 
resening the interactions between variable x\ and xi and x\ and X4 (see the 
4th column in Table [2]). Note how these values are negative, unlike in our pre- 
vious analysis where all significant associations were positive. Large negative 
values like these, as our theory suggests, is due to high overlap between v and 
tu-subtrees. For example, for x\ and x<i, their interaction must be such that 
whenever x\ splits a node in the tree, there is a high likelihood that xi is split 
underneath this node, and vice-versa. 
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